Hypothesis Testing

PSCI 2270 - Week 9

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

October 22, 2024

Plan for this week



  1. Hypothesis testing

  2. Discussion of two papers

Hypothesis testing

Answering questions with data

  • We conduct a thought experiment to see whether our results could have occurred by chance

  • Question: What would the world look like if we knew the truth?

    • Average treatment effect is \(0\)
    • Each individual effect is \(0\)
    • Sample mean is equal to \(X\)
  • Examples:

    • Biden’s support poll shows 40% now and it was 42% before. Did support decrease by 2% or could this purely by chance?
    • We encouraged random sample of people to go to protests and observe that their support for government policies is on average lower than among those who were not encouraged. Could this difference be produced by random chance?

Hypothesis testing



  • Hypothesis test: (a) Assume some value other than what you observe is true, (b) determine what the data would look like in that world, (c) compare this to what you observe
  1. Pose your null and alternative hypotheses

  2. Generate the data assuming null is true OR say something about its distribution

  3. Calculate a probability called a \(p\)-value by comparing outcomes under the null with what you observe

  4. Use \(p\)-value to decide whether to reject the null hypothesis or not

Null and alternative hypothesis

  • Null hypothesis: Some statement about the population parameters.

    • The “devil’s advocate” hypothesis \(\Rightarrow\) assumes what you seek to prove wrong
    • Ex: Biden’s approval is the same as election result
    • Ex: Treatment effect is zero for everyone
    • Denoted \(H_0\)
  • Alternative hypothesis: The statement we hope or suspect is true instead of \(H_0\).

    • It is the opposite of the null hypothesis
    • Ex: Biden’s approval fell
    • Ex: Treatment effect is different from zero (positive or negative)
    • Denoted \(H_1\) or \(H_a\)
  • Probabilistic proof by contradiction: Try to disprove the null

Practicum example


  • Parameter: Average Treatment Effect (ATE) \(\mu_T − \mu_C\) of encouragement on support for government policies

    • \(\mu_T\): Average support for government policies if everyone received encouragement
    • \(\mu_T\): Average support for government policies if no one received encouragement
  • Goal: Learn about the difference between average support for government policies between those who were encouraged and those who were not.
  • (Sharp) Null hypothesis: No treatment effect for anyone

    • \(H0\)\(Y_i(1) − Y_i(0) = 0\) for all \(i\)
    • \(H1\)\(Y_i(1) − Y_i(0) \neq 0\) at least for some \(i\) (Two-sided alternative)
    • In words: Does the treatment and control potential outcomes differ?
  • Other null hypothesis: No average treatment effect

\(p\)-value



  • \(p\)-value (based on a two-sided test): Probability of getting an (absolute) difference in means this big (or bigger) if the null hypothesis was true

    • Lower \(p\)-values \(\Rightarrow\) stronger evidence against the null
    • Intuition: How likely are we to observe what we observe if the null hypothesis is true?
  • Conclusion: We either reject (if \(p\)-value is small) or fail to reject (if \(p\)-value is large) the null

    • We never accept anything since the statement is probabilistic

How to calculate?

  • Using CLT as we did in the practicum

    • \(SE(\bar{Y}_{\text{treated}} - \bar{Y}_{\text{untreated}}) = \sqrt{\frac{\sigma_{\text{treated}}^2}{n_{\text{treated}}} + \frac{\sigma_{\text{untreated}}^2}{n_{\text{untreated}}}}\)
    • See how many SEs away from null hypothesis is our observed difference-in-means and use normal distribution properties to calculate \(p\)-value
  • Using the [OLS regression]]{.highlight}

    • In the model \(Y = \beta_0 + \beta_1 Z + \epsilon\) choose \(\beta_0\) and \(\beta_1\) such that \(\sum \epsilon^2\) as small as possible
    • \(\beta_1\) will be the same as difference in means and SE is going to be proportional to SD of \(\epsilon_i\) and inversely proportional to SD of \(Z_i\)
    • See how many SEs away from null hypothesis is our observed difference-in-means and use normal distribution properties to calculate \(p\)-value
  • Or we can do Randomization Inference (RI):

    • Assume the sharp null of no effect redraw treatment assignment many times calculating difference-in-means each time
    • See how many times we get the difference-in-means as extreme as the one we observed

RI: Observed Practicum Data

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 NA 0.859
2 1 1.930 NA 1.930
3 1 0.875 NA 0.875
4 0 2.944 2.944 NA
5 0 -1.015 -1.015 NA
6 0 -0.064 -0.064 NA
7 0 1.624 1.624 NA
8 0 -0.411 -0.411 NA
9 1 1.048 NA 1.048
10 1 -0.282 NA -0.282
  • The observed difference in means is 0.2704

  • For next step we need to reconstruct schedule of potential outcomes:

    • What do we substitute for NA if the sharp null hypothesis is true?
    • Is it the same as null of no average effect?

RI: Practicum Data under Sharp Null

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 0.859 0.859
2 1 1.930 1.93 1.93
3 1 0.875 0.875 0.875
4 0 2.944 2.944 2.944
5 0 -1.015 -1.015 -1.015
6 0 -0.064 -0.064 -0.064
7 0 1.624 1.624 1.624
8 0 -0.411 -0.411 -0.411
9 1 1.048 1.048 1.048
10 1 -0.282 -0.282 -0.282
  • Now we can permute treatment and calculate new difference-in-means

Permutation 1

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 0.859 0.859
2 0 1.930 1.93 1.93
3 0 0.875 0.875 0.875
4 1 2.944 2.944 2.944
5 0 -1.015 -1.015 -1.015
6 1 -0.064 -0.064 -0.064
7 0 1.624 1.624 1.624
8 0 -0.411 -0.411 -0.411
9 1 1.048 1.048 1.048
10 1 -0.282 -0.282 -0.282

Permutation 2

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 0.859 0.859
2 0 1.930 1.93 1.93
3 1 0.875 0.875 0.875
4 0 2.944 2.944 2.944
5 0 -1.015 -1.015 -1.015
6 1 -0.064 -0.064 -0.064
7 1 1.624 1.624 1.624
8 0 -0.411 -0.411 -0.411
9 1 1.048 1.048 1.048
10 0 -0.282 -0.282 -0.282

Permutation 3

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 0.859 0.859
2 0 1.930 1.93 1.93
3 0 0.875 0.875 0.875
4 0 2.944 2.944 2.944
5 0 -1.015 -1.015 -1.015
6 1 -0.064 -0.064 -0.064
7 1 1.624 1.624 1.624
8 1 -0.411 -0.411 -0.411
9 0 1.048 1.048 1.048
10 1 -0.282 -0.282 -0.282

Permutation 10

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 0.859 0.859
2 0 1.930 1.93 1.93
3 1 0.875 0.875 0.875
4 0 2.944 2.944 2.944
5 1 -1.015 -1.015 -1.015
6 0 -0.064 -0.064 -0.064
7 1 1.624 1.624 1.624
8 1 -0.411 -0.411 -0.411
9 0 1.048 1.048 1.048
10 0 -0.282 -0.282 -0.282

Permutation 100

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 0 0.859 0.859 0.859
2 1 1.930 1.93 1.93
3 1 0.875 0.875 0.875
4 0 2.944 2.944 2.944
5 1 -1.015 -1.015 -1.015
6 1 -0.064 -0.064 -0.064
7 0 1.624 1.624 1.624
8 1 -0.411 -0.411 -0.411
9 0 1.048 1.048 1.048
10 0 -0.282 -0.282 -0.282

Permutation 1000

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 1 0.859 0.859 0.859
2 0 1.930 1.93 1.93
3 1 0.875 0.875 0.875
4 1 2.944 2.944 2.944
5 0 -1.015 -1.015 -1.015
6 0 -0.064 -0.064 -0.064
7 0 1.624 1.624 1.624
8 1 -0.411 -0.411 -0.411
9 1 1.048 1.048 1.048
10 0 -0.282 -0.282 -0.282

Permutation 10000

Participant ID Z (Invited to protest) Y (Support for policies) Y (0)$ (Support for policies if not invited) Y (1) (Support for policies if invited)
1 0 0.859 0.859 0.859
2 1 1.930 1.93 1.93
3 0 0.875 0.875 0.875
4 0 2.944 2.944 2.944
5 0 -1.015 -1.015 -1.015
6 1 -0.064 -0.064 -0.064
7 1 1.624 1.624 1.624
8 1 -0.411 -0.411 -0.411
9 0 1.048 1.048 1.048
10 1 -0.282 -0.282 -0.282

  • What conclusion do we make?

Resulting \(p\)-values are similar!

Z <- c(1    , 1    , 1    , 0    , 0.    , 0.    , 0.   , 0.    , 1.   , 1)

Y <- c(0.859, 1.930, 0.875, 2.944, -1.015, -0.064, 1.624, -0.411, 1.048, -0.282)
Under CLT:
dim_observed <- mean(Y[Z == 1]) - mean(Y[Z == 0])

se <- 
  sqrt(
    var(Y[Z == 1])/length(Y[Z == 1]) + 
      var(Y[Z == 0])/length(Y[Z == 0])
  )

c(estimate = dim_observed,
  std.error = 2 * pnorm(q = abs(dim_observed), mean = 0, sd = se, lower.tail = FALSE))
 estimate std.error 
0.2704000 0.7382426 
OLS:
stats::lm(Y ~ Z) |> 
  summary() |> 
  broom::tidy() |> 
  dplyr::filter(term == "Z") |>
  dplyr::select(estimate, p.value)
# A tibble: 1 × 2
  estimate p.value
     <dbl>   <dbl>
1    0.270   0.747
Randomization inference:
dim_observed <- mean(Y[Z == 1]) - mean(Y[Z == 0])

dim_simulated <- 
  base::replicate(1000, {
    Z_simulated <- sample(Z)
    mean(Y[Z_simulated == 1]) - mean(Y[Z_simulated == 0])
  })

c(estimate = dim_observed,
  p.value = mean(abs(dim_simulated) >= abs(dim_observed)))
estimate  p.value 
  0.2704   0.7430 

Testing errors and false-positives

Testing errors

  • A \(p\)-value of \(0.05\) says that data this extreme would only happen in 5% of repeated samples if the null were true.

    • In other words our test results are not always correct!
    • 5% of the time we’ll reject the null when it is actually true.
  • Test results vs reality:
\(H_0\) is True \(H_0\) is False
\(H_0\) is not Rejected Great! Type II error
\(H_0\) is Rejected Type I error Amazing!
  • Type I error(False-positive) is the worst: Like convicting an innocent
  • Type II error(False negative) is less serious: Missed out on an awesome finding

College sports 🏈 and elections 🗳️


  • Irrelevant events affect voters’ evaluations of government performance by Healy, Malhotra, and Mo (2010)

  • Summary:

    • Previous research finds that natural disasters and other unpredictable events have an effect on incumbent politicians’ support
    • Authors argue that even events completely unrelated to politics can have an effect on incumbent’s support
    • Consider case of college footbal and basketball games
    • Mix of observational and survey experimental evidence
    • Several robustness excercises

Digging into argument


  • Should we expect events unrelated to politics to have an effect on voting?

    • Do government investments and response matter?
    • Do psychological factros matter?
    • What else could matter?
  • Let’s break into groups and then draw a possible causal diagram

Study 1: College Football and Elections



  • What are the dependent and independent variables?

  • How are those operationalized?

  • What are the tests they are running?
  • Why do they need so many tests?

Controls


  • What does it mean to control for the factors (covariates) in estimation? Why would we do that?
  • We control for the factors to remove the variation in dependent variable that is explained by them

    • This allows us to say that the part of variation we explain with our independent variable is not explained by the factor
    • In other words this addresses possibility that the factor is confounder (i.e. causes both independent and dependent variables)
  • Common controls: Demographics, lagged outcomes, fixed effects, etc.

Robustness and Placebo tests


  • What is robustness test?
  • Idea is to subset or append data to be able to run the test of the same expectation

    • Re-running analyses with covariates counts as robustness test
    • Each additional test that supports our theory increases our confidence in that theory
  • What is placebo? And what is placebo test?
  • Purpose of placebo test is to show that the variables do not vary with the factors irrelevant to our theory

    • Variant 1: Irrelevant independent variable does not predict outcome
    • Variant 2: Independent variable does not predict irrrelevant outcomes

Evidence


Study 2: Survey experiment


  • What are the dependent and independent variables?

  • How are those operationalized?

  • What are the tests they are running?

    • And why do we need to run this?

Evidence


Story of false-positives 😱


  • College football, elections, and false-positive results in observational research by Fowler and Montagnes (2015)

Summary:

  • Healy, Malhotra, and Mo (2010) did a great job and the paper had been influential, BUT
  • …Their results could have been produced by chance (spurious correlation) \(\Rightarrow\) False-positive
  • Approach is to take similar data and run additional robustness checks (akin to replication of experiment)
  • Additional robustness checks fail to support original theory

How do they show this

  • What is dependent and independent variables? How are they operationalized?

  • What tests do they run? Why do they run the same regression initially?

How do they explain these results


  • Fowler and Montagnes (2015) acknowledge that Healy, Malhotra, and Mo (2010) also ran robustness/placebo tests, BUT:
  1. Placebo tests do not provide evidence against false-positive

  2. Including controls (demographics and fixed effects) is not fully independent test and cannot guard against false-positive

  3. Explanation about why effects 10 days before would be stronger than 3 days before could suggest ex-post theory adjustment (wow…)

  4. They run multiple tests of each hypothesis but do not adjust for multiple comparisons (Bonferroni correction)

  5. Championships are bad proxy for interest and fail to replicate high-attendance results,

💀⚰️⏹️

Broader implications for research design



  • Think about all possible implications of theory and see if you can test them with your or other available data

    • This is especially important for observational studies
  • Operationalization of variables is crucial!

  • If you run many tests of the same hypothesis you need to adjust for multiple comparisons since you can come up with significant results just by chance

References

Fowler, Anthony, and B. Pablo Montagnes. 2015. “College Football, Elections, and False-Positive Results in Observational Research.” Proceedings of the National Academy of Sciences 112 (45): 13800–13804. https://doi.org/10.1073/pnas.1502615112.
Healy, Andrew J., Neil Malhotra, and Cecilia Hyunjung Mo. 2010. “Irrelevant Events Affect Voters’ Evaluations of Government Performance.” Proceedings of the National Academy of Sciences 107 (29): 1280412809.